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Summarized  in  this  report  is  research  from  a project  designed  to 
investigate  the  utility  of  item  characteristic  curve  theory  and  computerized 
adaptive  testing  as  means  of  measuring  and  reducing  ethnic  bias  and  unfairness 
in  ability  tests.  Included  are  a summary  of  the  research,  conclusions  and 
recommendations,  and  abstracts  of  all  previrus  reports.  Research  in  this 
project  comprised  a theory  development  phase  and  an  application  phase.  ■*  Dur- 
ing the  theory  development  phase,  an  item  characteristic  curve  theory  model 
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tigated  the  bias  reduction  and  fairness  properties  of  computerized  adaptive 
testing.  In  addition,  a methodology  for  detecting  test  item  bias  was 
developed  and  validated.  In  the  application  phase  the  bias  detection 
methodology  was  applied  to  six  sets  of  real  test  data.  In  addition,  the 
bias-reduction  properties  of  computerized  adaptive  testing  were  examined  in 
a live-testing  study  conducted  in  a racially  mixed  high  school.  The  results 
of  this  research  indicate  that  £TT)  item  characteristic  curve  theory  provides 
a viable  model  for  detecting  item  bias;  tfT)  the  incidence  of  item  bias  in 
existing  tests  is  small,  but  because  of  its  potential  adverse  effects,  ability 
tests  should  be  carefully  examined  for  possible  bias;  £37  Black  students  have 
different  psychological  reactions  to  the  conditions  of  testing  than  White 
students;  and  computerized  adaptive  testing  can  improve  ability  measure- 
ment for  Black  students. 
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Final  Report: 

Bias-Free  Computerized  Testing 


This  is  the  final  report  of  a project  which  examined  item  characteristic 
curve  theory  and  computerized  adaptive  testing  as  possible  means  of  measuring 
and  reducing  ethnic  bias  in  ability  tests.  The  objectives  of  this  project 
included  the  evaluation  of  bias  in  existing  tests  and  the  exploration  of  the 
potential  of  adaptive  testing  for  improving  ability  measurement  in  minority 
groups.  Included  in  this  report  are  a brief  description  of  the  background 
for  this  research;  the  project  objectives;  and  a summary  of  the  research 
methodology,  major  findings,  and  conclusions.  Also  included  are  abstracts 
of  the  six  Technical  Reports  published  and  a listing  of  all  other  papers 
completed  under  this  project. 

Background. 


In  recent  years  there  has  been  considerable  controversy  over  the  use  of 
ability  tests  for  personnel  selection  and  placement.  The  focus  of  this 
controversy  is  the  claim  by  members  of  minority  groups  that  ability  tests 
constructed  under  current  procedures  are  biased  against  them  and  therefore 
unfair.  This  has  led  to  a number  of  legal  challenges  in  the  courts,  as  well 
as  to  a search  for  solutions  to  these  problems. 

Since  the  Navy  and  the  other  military  services  use  ability  tests  in  their 
personnel  selection,  placement,  and  classification  activities,  it  is  important 
to  examine  the  extent  and  impact  of  the  possible  bias  that  may  exist  in  their 
ability  tests  and  to  investigate  ways  of  reducing  or  eliminating  it.  In 
addition,  development  of  generalized  methods  for  identifying  and  eliminating 
test  bias  would  have  important  implications  for  other  governmental  agencies 
which  use  tests,  as  well  as  for  test  users  in  industry  and  education. 

Objectives 

The  purpose  of  this  contract  was  to  investigate  how  two  recent  develop- 
ments in  psychological  measurement  could  be  used  for  investigating  and 
eliminating  or  reducing  the  differential  effects  of  ability  tests  on  mincrity 
groups.  These  two  developments  are  item  characteristic  curve  (ICC)  theory 
and  computerized  adaptive  testing.  ICC  theory  is  a new  approach  to  psycho- 
logical testing  which  emerged  in  the  1960s  as  a replacement  for  the  tradi- 
tional test  theories  that  have  been  the  basis  for  the  construction  of  ability 
tests  for  over  50  years.  Computerized  adaptive  testing  is  the  application 
of  on-line  computers  to  the  administration  of  ability  tests  which  adapt 
themselves  to  individual  differences  in  levels  of  ability  during  the  process 
of  test  administration.  The  basic  advances  in  ICC  theory  and  computerized 
adaptive  testing  are  being  made  through  other  research  contracts  under  the 
support  of  the  Office  of  Naval  Research  Personnel  and  Training  Research 
Programs.  The  present  contract  was  concerned  with  whether  ICC  theory  and 
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computerized  adaptive  testing  could  be  used  to  improve  ability  testing  for 
members  of  minority  groups. 


Approach  and  Major  Results 


The  research  activities  designed  to  address  this  question  were  organized 
into  a theory  development  phase  and  an  application  phase  as  shown  in  Figure  1. 
The  theory  development  phase,  diagrammed  in  Figure  1 above  the  dashed  line, 
had  as  its  purpose  the  definition  of  the  problem  in  operational  terms  and  the 
development  of  a theoretical  base  to  measure  the  relevant  variables.  In  the 
application  phase,  shown  below  the  dashed  line  in  Figure  1,  the  concepts 
developed  in  the  theory  development  phase  were  tested  in  a series  of  empirical 
studies. 


Theory  development  phase.  The  first  step  in  the  theory  development 
phase  was  to  review  the  literature  on  the  definitions  of  terms  and  existing 
methodologies  with  regard  to  test  bias  and  test  fairness  (Research  Report 
76-5).  This  review  led  to  a distinction  between  test  bias  and  test  fairness 
which  had  not  been  clearly  articulated  earlier  in  the  literature.  Test  bias 
was  defined  as  characteristics  of  the  items  constituting  the  test.  Fairness, 
on  the  other  hand,  was  defined  as  a characteristic  of  the  test  itself  and  the 
use  to  which  it  is  put.  Thus,  it  was  possible  that  a test  composed  of  un- 
biased test  items  could  still  be  used  unfairly  to  discriminate  against 
members  of  minority  groups.  The  importance  of  this  distinction  is  that  it 
permitted  a division  of  relevant  research  into  two  separable  areas — bias  and 
fairness — and  a clarification  of  the  issues  involved.  Once  the  distinction 
between  bias  and  fairness  is  clearly  understood  by  test  users,  it  should  be 
possible  in  a given  situation  to  define  clearly  whether  it  is  the  test 
itself  that  is  at  fault  (bias)  or  whether  it  is  the  use  to  which  the  test 
scores  are  to  be  put  (fairness)  that  causes  the  undesirable  results. 


In  addition,  this  distinction  served  to  concentrate  effort  separately 
on  the  two  types  of  issues  involved.  Thus,  with  regard  to  test  bias,  the 
distinction  first  led  to  a definition  of  test  bias  phrased  in  terms  of  ICC 
theory.  This,  in  turn,  led  to  a procedure  for  the  detection  of  bias  in 
test  items. 


With  regard  to  test  fairness,  the  ICC  definition  of  test  bias  had 
implications  for  a series  of  computer  simulation  studies  on  the  effects  of 
item  bias  and  test  strategy  on  test  fairness  (Research  Reports  76-5  and 
78-1).  These  studies  varied  three  major  variables:  (1)  characteristics 
of  a Bayesian  adaptive  testing  strategy  (Research  Report  78-1),  (2)  the 
effects  of  item  characteristics  on  test  fairness  (Research  Report  76-5), 
and  (3)  the  interaction  of  item  characteristics  and  testing  strategy 
(Research  Report  78-1).  A general  conclusion  drawn  from  the  simulation 
studies,  based  on  the  models  developed  in  this  project,  was  that  computerized 
adaptive  testing  could  be  designed  to  take  into  account  the  bias  existing  in 
test  items  in  such  a way  that  the  fairness  of  resultant  applications  of  test 
scores  would  be  considerably  reduced  over  that  from  conventional  tests.  Thus, 
the  simulation  studies  showed  that  computerized  adaptive  testing, in  conjunc- 
tion with  the  ICC  definition  of  test  bias  and  the  methodologies  for  its 
detection  which  were  developed  in  this  proj ect, could  result  in  fairer  tests. 


Application  phase.  The  methodologies  developed  in  the  theory  development 
phase  were  then  applied  in  empirical  studies  in  the  application  phase.  These 
activities  followed  the  basic  distinction  between  bias  and  fairness  developed 
in  the  theory  development  phase.  With  regard  to  item  bias,  the  bias  detection 
methodology  developed  earlier  in  the  project  was  validated  and  was  applied  to 
several  sets  of  real  test  data. 

The  question  in  the  validation  phase  was  whether  or  not  it  was  possible 
to  use  the  methodology  developed  to  detect  items  which  were  known  to  be 
biased.  To  investigate  this  question  (Research  Report  78-3),  a test  was 
purposely  constructed  which  consisted  of  some  biased  items;  this  test  was 
administered  to  groups  of  differing  racial  composition.  The  data  analysis  was 
concerned  with  determining  whether  the  methods  developed  in  the  theory  devel- 
opment phase  were  able  to  identify  as  biased  those  items  which  were  known  to 
be  heavily  biased.  The  test  was  a vocabulary  test  consisting  of  127  items; 
one-third  of  the  test  items  were  written  to  be  biased  in  favor  of  Black 
students.  These  items  were  multiple-choice  items  in  which  the  correct  answer 
was  a definition  indigenous  to  the  Black  culture  that  would  not  be  common 
knowledge  to  White  students;  the  remainder  of  the  response  alternatives  were 
definitions  which  would  be  correct  in  neither  culture.  Similarly,  one-third 
of  the  words  in  the  test  were  biased  in  favor  of  White  students.  These  were 
test  items  which  would  be  predominately  known  in  the  White  culture  and  not 
in  the  Black  culture.  The  rest  of  the  words  in  the  test  were  standard 
vocabulary  items  taken  from  a pool  of  600  vocabulary  test  items  used  in 
adaptive  testing  research  at  the  University  of  Minnesota. 

The  results  of  this  study  showed  that  the  methodology  developed  to  detect 
bias  correctly  identified  a portion  of  the  a priori  biased  items  for  both 
Black  students  and  White  students.  The  most  strongly  biased  items  in  this 
analysis  are  shown  in  Table  1.  The  three  most  strongly  biased  items  against 
White  students  were  "shouting,"  "fry,"  and  "African  dominoes";  and  those 
most  strongly  biased  against  Black  students  were  "cameo"  and  "lox."  In  each 
case,  the  definition  of  bias  was  based  on  the  fact  that  White  students  (or 
Black  students)  performed  more  poorly  on  these  test  items  than  did  members  of 
the  other  group. 


Table  1 

Biased  Items  Identified  as  Biased 
by  the  ICC-Based  Procedure 


Item 

Correct  Answer 

Items  Biased  Against 

Whites 

shouting 

in  religious  sense 

fry 

to  curl  one's  hair 

African  dominoes 

dice  game 

Items  Biased  Against 

Blacks 

cameo 

gem  carved  in  relief 

lox 

smoked  salmon 

Given  the  validation  of  the  bias  detection  methodology  as  a result  of 
this  live-testing  application,  the  methodology  was  applied  to  a number  of 
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other  data  sets  (Research  Reports  77-1,  78-3, and  78-3),  as  summarized  in 
Figure  2.  Application  of  the  methodology  requires  two  sets  of  data  on  a 
majority  and  a minority  group.  The  data  are  factor  analyzed,  and  if  one 
dominant  factor  appears  for  each  of  the  two  groups,  the  process  continues. 

If  more  than  one  factor  is  detected  in  either  group,  other  methods  are  re- 
quired to  answer  the  question.  For  those  data  sets  in  which  one  factor 
exists,  the  procedure  continues  by  splitting  the  majority  group  into  two 
subgroups — J1  and  J2.  ICC  item  parameterization  methods  are  then  used  to 
estimate  the  difficulty  ( b ) parameter  for  both  of  the  majority  subgroups  and 
for  the  minority  group. 

The  resulting  values  are  compared  by  a statistical  methodology  developed 
in  this  research  to  determine  whether  or  not  some  of  the  items  in  the  test 
are  biased.  Two  outcomes  may  result  from  this  analysis.  Either  the  items 
will  be  found  to  be  biased,  or  they  will  not.  If  no  items  are  found  to  be 

biased,  then  the  factors  obtained  in  the  two  groups  are  compared;  if  the 

factors  are  comparable,  the  test  items  can  be  said  to  be  unbiased.  If  the 
factors  are  not  found  to  be  comparable,  this  may  indicate  that  there  is  a 
constant  degree  of  bias  in  all  the  items  or  that  the  test  measures  different 
dimensions  for  the  two  racial  groups. 

If  some  items  are  biased,  the  question  to  be  raised  is  whether  the  items 
are  reliably  biased.  This  is  studied  by  a comparison  of  the  item  bias  values 
for  each  of  the  majority  subgroups  versus  the  minority  group,  which  then 
leads  to  a conclusion  of  either  unreliably  biased  or  reliably  biased  items. 

If  the  items  are  reliably  biased,  the  question  of  the  comparability  of 
factors  is  investigated  by  comparing  the  factors  in  the  two  groups.  Depending 
on  the  outcome  of  this  comparison,  it  can  be  concluded  that  (1)  the  test 
measures  the  same  thing  for  both  groups,  but  with  some  biased  items,  or 

that  (2)  the  test  is  biased  on  different  dimensions. 

This  methodology  was  subsequently  applied  to  seven  different  tests  to 
determine  degrees  of  bias  in  those  test  items.  The  test  included  the  Gates 
Reading  Test  (a  test  used  in  elementary  and  high  schools),  the  Navy  Enlisted 
Advancement  Examinations  for  Boiler  Technician  and  Advanced  Machinists  Mate, 
the  verbal  and  quantitative  sections  of  the  School  and  College  Aptitude 
Tests  (SCAT  II;  Research  Report  78-5),  and  ability  tests  developed  for  this 
research  at  the  University  of  Minnesota  (Research  Reports  77-1  and  78-3). 

The  results  shown  in  Table  2 indicate  that  there  were  very  low  levels  of 
bias  in  the  majority  of  the  tests,  using  the  methodology  developed.  The  test 
with  the  highest  degree  of  bias  was  the  one  discussed  above,  which  was  explic- 
itly developed  to  have  large  numbers  of  biased  items.  Each  of  the  remaining 
tests,  with  the  exception  of  the  Navy  Enlisted  Advancement  Examinations,  was 
found  to  have  two  or  three  biased  items.  These  results  imply  that  there  are 
a small  number  of  biased  items  on  some  ability  tests,  and  care  should  be 
taken  to  screen  items  in  ability  tests  in  order  to  remove  items  which  display 
subgroup  biases. 

The  results  shown  in  Table  2 indicate  that  the  Navy  Enlisted  Advancement 
Examinations  were  not  completely  analyzed.  These  tests  were  tests  of 
achievement  and  were  found  to  be  highly  multidimensional.  Consequently,  they 
did  not  meet  the  single-factor  criterion  required  by  the  bias  measurement 
methodology.  More  research  is  needed  to  develop  methods  for  the  detection 
of  bias  in  achievement  tests. 


Table  2 

Summary  of  the  Extent  of  Bias  F nd  in  Seven  Sets  of  Test  Data 


Test 

Type 

Sample 

Minority 

Size 

Majority 

Number 

Total 

of  Items 
Biased 

Gates  Reading 

Test 

Reading  Test 

261 

578 

50 

2 

Navy  Enlisted 

Advancement 
Navy  Enlisted 

Exam 

Boiler  Technician 
Advanced  Mach- 

79 

498 

150 

k 

Advancement 

Exam 

inists  Mate 

47 

656 

150 

k 

SCAT  II 

Verbal 

129 

251 

45 

2 

SCAT  II 

Quantitative 

129 

251 

45 

3 

U of  Minnesota 

Ability  Test 

Vocabulary 

58 

168 

75 

2 

"Biased  Test 

"** 

Vocabulary 

92 

173 

127 

12 

* Bias  analysis  could  not  be  applied  due  to  multidimensionality  of  test 


items . 

**  This  was  the  validation  test  discussed  above. 


The  second  part  of  the  application  phase  of  this  project  was  a live- 
testing  study  which  compared  strategies  of  computerized  adaptive  testing 
designed  to  reduce  test  bias  with  conventional  tests  typically  used  to  measure 
verbal  ability  (Research  Report  79-2).  In  addition  to  studying  the  specially 
designed  bias-reduction  properties  of  adaptive  testing,  a variable  found  in 
a related  project  was  studied  to  determine  its  effect  on  test  performance. 

This  variable  was  the  effect  of  immediate  knowledge  of  results  on  the  ability 
test  performance  of  Black  and  White  high  school  students.  Additional 
dependent  variables  in  this  study  were  the  reactions  of  the  students  to  the 
test-taking  conditions.  The  results  of  this  study  showed  that  Black  students 
reacted  differently  than  White  students  to  the  conditions  of  testing,  speci- 
fically to  the  provision  of  immediate  knowledge  of  results  and  the  mode  of 
test  administration.  The  Black  students  were  also  more  motivated  by  the 
adaptive  tests  than  by  the  conventional  tests.  The  ability  data  showed  that 
the  bias-reduced  tests  eliminated  mean  racial  differences  in  ability  estimates 
when  these  tests  were  administered  without  knowledge  of  results.  Thus,  it  is 
relevant  to  consider,  not  only  the  items  themselves  in  terms  of  their  bias, 
but  the  conditions  and  strategies  of  test  administration  as  well,  in  an 
attempt  to  reduce  the  adverse  effects  of  ability  tests  on  the  scores  and 
performance  of  members  of  minority  groups. 

Conclusions 


This  was  the  first  research  project  in  which  item  characteristic  curve 
theory  and  computerized  adaptive  testing  were  investigated  as  means  of 
improving  ability  tests  for  minorities.  Based  on  the  findings  of  this 
project,  it  appears  that  item  characteristic  curve  theory  and  computerized 
adaptive  testing,  used  either  singly  or  jointly,  are  viable  means  of 
accomplishing  this  objective. 

Seven  tests,  including  the  Navy  Enlisted  Advancement  Examination,  were 
examined  for  bias  using  a methodology  based  on  ICC  theory  developed  in  this 


project.  On  the  average,  about  4%  of  the  test  items  examined  were  found  to  be 
biased.  Although  this  is  a relatively  small  amount  of  bias,  it  could  lead  to 
a relatively  large  number  of  individuals  being  discriminated  against  in  a large- 
scale  testing  program.  Therefore,  methods  such  as  the  one  developed  in  this 
project  should  be  used  regularly  during  the  earlier  stages  of  test  development 
to  screen  out  biased  items. 

The  potential  of  adaptive  testing  for  reducing  bias  and  test  unfairness 
was  explored  by  using  computer  simulations  of  one  adaptive  testing  procedure, 
as  well  as  by  the  administration  of  actual  computerized  adaptive  tests  in  a 
public  high  school.  The  general  conclusion  drawn  from  the  simulation  studies 
was  that  adaptive  tests,  because  of  their  ability  to  tailor  item  administration 
to  the  individual  being  tested,  have  the  potential  to  be  more  reliable  and  fair 
for  members  of  minority  groups  than  conventional  tests.  This  general  finding 
was  further  explored  in  the  live-testing,  as  opposed  to  simulated-testing, 
phase  of  this  project. 

The  live-testing  phase,  conducted  in  a racially  mixed  public  high  school, 
compared  several  adaptive  and  several  paper-and-pencil  tests  of  verbal  ability. 
In  addition,  the  effect  of  immediate  knowledge  of  results  was  also  examined. 

The  results  of  this  study  supported  earlier  research  in  showing  that  Black 
students  had  different  psychological  reactions  than  White  students  to  the  con- 
ditions of  testing,  specifically  to  the  provision  of  immediate  knowledge  of 
results  and  the  mode  of  test  administration  (computerized  versus  paper-and- 
pencil).  The  data  also  showed  that  under  certain  conditions,  the  bias-reduced 
tests  eliminated  mean  racial  group  differences  in  ability  estimates. 

In  addition,  evidence  was  found  in  this  research  program  to  support  the 
idea  that  computerized  adaptive  testing  can  improve  ability  measurement  for 
Black  students  in  several  ways.  Finally,  the  overall  results,  both  for  Black 
and  White  students,  added  to  the  growing  body  of  evidence  which  indicates  the 
general  superiority  of  computerized  adaptive  testing  over  conventional  paper- 
and-pencil  testing  in  the  measurement  of  abilities. 
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ABSTRACTS  OF  RESEARCH  REPORTS 


Research  Report  76-5 

Effects  of  Item  Characteristics  on  Test  Fairness 

Steven  M.  Pine  and  David  J.  Weiss 
December  1976 

This  report  examines  how  selection  fairness  is  influenced  by  the  item  char- 
acteristics of  a selection  instrument  in  terms  of  its  distribution  of  item 
difficulties,  level  of  item  discrimination,  and  degree  of  item  bias.  Com- 
puter simulation  was  used  in  the  administration  of  conventional  ability  tests 
to  a hypothetical  target  population  consisting  of  a minority  and  a majority 
subgroup.  Fairness  was  evaluated  by  three  indices  which  reflect  the  degree 
of  differential  validity,  errors  in  prediction  (Cleary's  model),  and  proportion 
of  applicants  exceeding  a selection  cutoff  (Thorndike's  model).  Major  findings 
were  that  (1)  tests  with  a uniform  distribution  of  difficulties  had  fairness 
properties  generally  superior  to  tests  having  a peaked  distribution  of  item 
difficulties;  (2)  subgroup  validity  differences  can  be  expected  to  occur  when 
test  items  are  biased  against  one  of  the  subgroups;  (3)  when  differential 
prediction  is  used,  the  Thorndike  model  reflects  varying  degrees  of  unfairness 
due  to  item  bias  and  other  test  characteristics,  while  the  Cleary  and  validity 
models  do  not;  (4)  differential  prediction  provides  fairer  selection  than  the 
use  of  majority  prediction  only,  regardless  of  the  internal  characteristics  of 
the  test,  although  substantial  degrees  of  unfairness  still  exist  under  certain 
test  item  configurations.  It  was  concluded  that  the  internal  characteristics 
of  a selection  instrument  will  affect  the  fairness  of  test  scores  in  specific 
applications  and  that  further  research  is  needed  to  delineate  which  testing 
strategies  and/or  item  characteristics  are  optimal  in  reducing  unfairness. 


Research  Report  77-1 

Applications  of  Item  Characteristic  Curve  Theory  to  the  Problem  of  Test  Bias 

Steven  M.  Pine 

In  David  J.  Weiss  (Ed.),  Applications  of  Computerized  Adaptive  Testing 

March  1977 

It  is  argued  that  a major  problem  in  current  efforts  to  develop  less  biased 
tests  is  an  over-reliance  on  classical  test  theory.  Item  characteristic 
curve  (ICC)  theory,  which  is  based  on  individual  rather  than  group-oriented 
measurement,  is  offered  as  a more  appropriate  measurement  model.  A definition 
of  test  bias  based  on  ICC  theory  is  presented.  Using  this  definition,  several 
empirical  tests  for  bias  are  presented  and  demonstrated  with  real  test  data. 
Additional  applications  of  ICC  theory  to  the  problem  of  test  bias  are  also 
d iscussed . 


Research  Report  78-1 

A Comparison  of  the  Fairness  of  Adaptive  and  Conventional  Testing  Strategies 

Steven  M.  Pine  and  David  J.  Weiss 
August  1978 

This  report  examines  how  selection  fairness  is  influenced  by  the  character- 
istics of  a selection  instrument  in  terms  of  its  distribution  of  item 
difficulties,  level  of  item  discrimination,  degree  of  item  bias,  and  testing 
strategy.  Computer  simulation  was  used  in  the  administration  of  either  a 
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conventional  or  a Bayesian  adaptive  ability  test  to  a hypothetical  target 
population  consisting  of  a minority  and  a majority  subgroup.  Fairness  was 
evaluated  by  three  indices  which  reflect  the  degree  of  differential  validity, 
errors  in  prediction  (Cleary's  model),  and  proportion  of  applicants  exceeding 
a selection  cutoff  (Thorndike's  model).  Major  findings  were  (1)  when  used 
in  conjunction  with  either  the  Bayesian  adaptive  or  the  conventional  test, 
differential  prediction  increased  fairness  and  facilitated  the  interpretation 
of  the  fairness  indices;  (2)  the  Bayesian  adaptive  tests  were  consistently 
fairer  than  the  conventional  tests  for  all  item  pools  above  the  a= . 7 dis- 
crimination level  for  tests  of  more  than  30  items;  (3)  the  differential 
prediction  version  of  the  Bayesian  adaptive  test  produced  almost  perfectly 
fair  performance  on  all  fairness  indices  at  high  discrimination  levels;  and 
(4)  the  placement  of  subgroup  prior  distribution  in  the  Bayesian  adaptive 
testing  procedure  can  affect  test  fairness. 

Research  Report  78-3 

A Comparison  of  Levels  and  Dimensions  of  Performance  in  Black  and  White 
Groups  on  Tests  of  Vocabulary,  Mathematics,  and  Spatial  Ability 

Austin  T.  Church,  Steven  M.  Pine,  and  David  J.  Weiss 

October  1978 

The  nature  and  extent  of  ability  test  performance  differences  between  Black 
and  White  high  school  students  on  vocabulary,  mathematics,  and  spatial  ability 
tests  were  examined.  Mean  differences  on  total  test  scores  were  found  for  all 
three  tests,  with  Whites  averaging  higher  than  Blacks.  In  the  vocabulary  test, 
however,  this  effect  could  not  be  interpreted  independently  of  sex  and  parents' 
educational  level.  Parents'  educational  levels  were  significantly  related  to 
performance  on  the  vocabulary  and  spatial  tests;  in  the  vocabulary  test 
parental  education  interacted  with  the  race  and  sex  variables.  Separate 
factor  analyses  were  performed  for  the  Black  and  White  groups  to  determine  the 
number  and  nature  of  dimensions  underlying  performance  for  each  group.  While 
the  number  of  factors  needed  to  account  for  the  common  item  variance  in  each 
test  was  the  same  for  Blacks  and  Whites,  items  defining  each  factor  and  the 
correlations  of  factors  across  the  three  tests  indicated  that  the  nature  of 
the  factors  was  different  for  the  two  groups.  For  the  vocabulary  test, 
degree  of  item  bias  was  evaluated  in  terms  of  the  difference  in  item  dif- 
ficulties for  Blacks  and  Whites  as  indexed  by  the  difficulty  ( b ) parameter 
of  item  characteristic  curve  (ICC)  theory.  Comparison  of  the  ICC  item 
parameters  for  the  Blacks  and  the  Whites  showed  differences  in  both  difficul- 
ties and  discriminations.  By  comparing  the  index  of  item  bias  with  the 
vocabulary  factor  structures  in  both  groups,  a "bias"  factor  defined  by 
"Black-type"  words  was  identified  in  the  White  group.  Analysis  of  racial 
group  differences  in  relationships  among  subtest  scores  and  factor  scores 
showed  that  Whites  had  more  common  variance  among  subtests  than  Blacks,  with 
the  largest  differences  occurring  where  the  vocabulary  test  was  involved.  It 
was  concluded  that  when  the  factor  structures  underlying  ability  tests  differ 
sufficiently  for  two  or  more  racial  groups,  the  meaning  of  mean  group 
performance  differences  becomes  less  clear.  Investigation  of  the  fairness 
of  psychometric  tests  should  include  examination  of  possible  bias  at  both 
item  and  factor  levels. 


,, 
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Research  Report  78-5 

An  Item  Bias  Investigation  of  a Standardized  Aptitude  Test 

John  T.  Martin,  Steven  M.  Pine,  and  David  J.  Weiss 
December  1978 

Verbal  and  quantitative  data  from  a standardized  aptitude  test  (SCAT,  Series 
II,  Level  2)  were  analyzed  separately  for  Native  American  and  White  high 
school  students.  Item  correlation  matrices  were  factor  analyzed  for  each 
group,  separately  for  each  ability.  Coefficients  of  congruence  comparing 
factor  structures  between  groups  were  high  for  the  first  verbal  factor  and 
the  first  and  second  quantitative  factors,  implying  that  ability  factor 
structures  were  similar  for  the  two  groups.  The  first  factors  were  of 
sufficient  size  to  allow  parameterization  of  the  items  by  item  characteristic 
curve  (ICC)  methods.  Item  difficulty  (fo)  parameters  derived  for  the  two 
groups  were  compared  by  regressing  difficulty  parameters  for  the  Native 
American  group  on  the  difficulty  parameters  for  the  White  group,  and  values 
of  elliptic-D  were  computed  for  each  item  and  group.  Results  led  to  the 
conclusion  that  there  were  no  reliably  biased  items  in  the  verbal  subtest, 
while  there  were  two  reliably  biased  items  in  the  quantitative  subtest — 
one  item  biased  against  the  Native  American  group  and  one  item  biased 
against  the  White  group.  Internal  consistency  reliabilities  were  higher  for 
the  Native  American  group  in  both  tests,  and  the  scores  of  the  Native  American 
students  were  better  predictors  of  high  school  rank  than  were  scores  for 
the  White  students;  but  these  results  were  significant  (p<.05)  only  for  the 
quantitative  subtest.  Results  indicated  that  different  approaches  to  the 
identification  of  bias  led  to  different  conclusions.  Thus,  additional 
research  is  needed  to  determine  which  indices  of  item  and  test  bias  yield 
the  most  meaningful  approach  to  the  analysis  of  bias  in  ability  tests. 

Research  Report  79-2 

Effects  of  Computerized  Adaptive  Testing  on  Black  and  White  Students 

Steven  M.  Pine,  Austin  T.  Church,  Kathleen  A.  Giailuca,  and  David  J.  Weiss 

March  1979 

Bias-reduced  and  non-bias-reduced  conventional  paper-and-penc il  and  computer- 
ized adaptive  tests  of  word  knowledge  were  administered  to  Black  and  White 
high  school  students  to  study  differential  effects  on  ability  estimates  and 
psychological  reactions.  Independent  variables  examined  were  bias  reduction, 
the  presence  or  absence  of  knowledge  of  results  after  each  item,  mode  of 
administration  (paper-and-penc il  or  computerized  adaptive),  order  of  adminis- 
tration, and  race.  Dependent  variables  were  three  test  performance  variables 
(f'he  ability  estimates  derived  from  both  conventional  paper-and-pencil  and 
computerized  adaptive  tests,  the  variance  of  those  estimates,  and  the  number 
of  omitted  responses)  and  four  psychological  reaction  variables  (reaction  to 
knowledge  of  results,  nervousness,  motivation,  and  guessing).  Bias-reduced 
tests  were  specially  constructed  from  items  which  had  previously  been  shown 
to  be  less  biased  towards  Black  students  in  terms  of  an  item  bias  index 
derived  from  item  characteristic  curve  (ICC)  theory.  The  bias -reduced  tests 
eliminated  mean  racial  differences  between  Black  and  White  students  ±nder 
certain  test  conditions,  but  the  effect  interacted  with  other  conditions  of 
test  administration,  e.g.,  whether  or  not  knowledge  of  results  was  provided. 
Since  the  bias-reduced  tests  provided  less  precise  measurement  than  the  non- 
bias-reduced tests,  it  was  concluded  that  more  traditional  item  statistics, 
such  as  item  discriminations,  should  be  considered  along  with  an  index  of  item 
bias  in  test  construction.  Computerized  adaptive  tests  were  generally  shown 
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to  be  more  motivating  than  the  conventional  paper-and-pencil  tests.  Black 
students,  in  particular,  seemed  to  be  less  tolerant  of  the  conventional 
paper-and-pencil  tests,  especially  when  taken  after  the  adaptive  tests.  This 
was  reflected  in  levels  of  reported  motivation,  number  of  omitted  responses, 
and  reported  amounts  of  guessing.  Differential  psychological  reactions  for 
Black  and  White  students  were  found  for  other  conditions  of  test  adminis- 
tration as  well;  however,  the  computer-administered  adaptive  tests  appeared 
to  reduce  these  differences  in  comparison  to  the  conventional  paper-and-pencil 
tests.  These  data  imply  the  need  for  further  study  of  the  effects  of  test 
administration  conditions  on  members  of  minority  groups  to  determine  those 
administration  conditions  which  maximize  ability  estimates  either  directly  or 
through  their  effects  on  the  psychological  environment  of  testing. 
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